Compressing Inverted Lists
نویسندگان
چکیده
The performance of Information Retrieval systems is a key issue in large web search engines. The use of inverted indexes and compression techniques is partially accountable for the current performance achievement of web search engines. In this paper, we introduce a new class of compression techniques for inverted indexes, the Adaptive Frame of Reference, that provides fast query response time, good compression ratio and also fast indexing time. We compare our approach against a number of state-of-the-art compression techniques for inverted index based on three factors: compression ratio, indexing and query processing performance. We show that significant performance improvements can be achieved. 1DERI (Digital Enterprise Research Institute), National University of Ireland, Galway , IDA Business Park, Lower Dangan, Galway, Ireland. Acknowledgements: The work presented in this paper has been funded in part by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2). Copyright c © 2010 by the authors DERI TR 2010-12-16 1
منابع مشابه
Partitioning Inverted Lists for Efficient Evaluation of Set-Containment Joins in Main Memory
We present an algorithm for efficient processing of set-containment joins in main memory. Our algorithm uses an index structure based on inverted files. We focus on improving performance of the algorithm in a main-memory environment by utilizing the L2 CPU cache more efficiently. To achieve this, we employ some optimizations including partitioning the inverted lists and compressing the intermed...
متن کاملInverted Index Compression
The data structure at the core of nowadays large-scale search engines, social networks and storage architectures is the inverted index, which can be regarded as being a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by search engines and stringent performance requirements dictated by the heavy load of user queries, the inverted lists often st...
متن کاملAn Asymptotically Optimal Data Compression Algorithm Based on an Inverted Index
The usual method of representing a data sequence drawn from a nite alphabet associates with each location in the sequence, the source letter that appears there. An alternate approach is to associate with each source letter, the list of locations at which it appears in the data sequence [1]. We present a data compression algorithm based on a generalization of this idea. The algorithm parses the ...
متن کاملUniversal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We int...
متن کاملParallel Text Query Processing using Composite Inverted Lists
The inverted lists strategy is frequently used as an index data structure for very large textual databases. Its implementation and comparative performance has been studied in sequential and parallel applications. In the latter, with relatively few studies, there has been a sort of “which-is-better” discussion about two alternative parallel realizations of the basic data structure and algorithms...
متن کامل